[webgpu] Optimize Conv by im2col-matmul #26603

daijh · 2025-11-19T03:19:13Z

Description

This PR optimizes the Conv operation by implementing two new compute shaders: oihw_to_ohwi and im2col-matmul.

oihw_to_ohwi:
Improves performance over the default Transpose shader by utilizing workgroup memory to ensure continuous memory read/write patterns.

im2col-matmul:

Employs a workgroup size of 64.
Dynamically selects tile sizes (32x64 or 16x64) based on the source/weight shape.
Each invocation handles a dedicated weight element.
Uses subgroupShuffle to efficiently access the source tile, leveraging k_vec4 vectorization for better memory throughput.

Testing on Lunar Lake demonstrated up to an 87% performance improvement in Conv_2D operations.

Motivation and Context

See above.

daijh · 2025-11-19T03:22:15Z

Lunar Lake
onnxruntime commit d55ade0

Operation

Milliseconds	conv2d-mm	im2col-matmul
src: 1x128x512x512 weight: 128x128x3x3	56.071	42.824
src: 1x2560x8x8 weight: 1280x2560x3x3	21.066	11.263
src: 1x1280x8x8 weight: 1280x1280x3x3	10.384	6.357

sd-turbo

Milliseconds	conv2d-mm	im2col-matmul
sd-turbo-unet-fp16-demo.onnx	1010.245	612.092
sd-turbo-vae-decoder-fp16-demo.onnx	2317.391	1848.545

daijh · 2025-11-19T08:01:46Z

@guschmue @fs-eire @qjia7 PTAL.

qjia7 · 2025-11-24T09:07:07Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+  const uint32_t kernel_height = onnxruntime::narrow<uint32_t>(kernel_shape[2]);
+  const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]);
+
+  TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};


Suggested change

TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

qjia7 · 2025-11-24T09:07:30Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+  const uint32_t kernel_width = onnxruntime::narrow<uint32_t>(kernel_shape[3]);
+
+  TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};
+  Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);


Suggested change

Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

qjia7 · 2025-11-24T10:07:46Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.cc

+
+  const uint32_t M_tiles = ceil_div(im2col_m, tile_m);
+  const uint32_t N_tiles = ceil_div(im2col_n, tile_n);
+  im2col_mm_program.SetDispatchGroupSize(M_tiles, N_tiles, batch);


How about enhancing the current TransposeProgram with shared path instead of adding a new one?
You are doing transpose from perm [0, 1, 2, 3] to perm [0, 2, 3, 1]. It equals that we are transposing from [o, i, hw] to [o, hw, i]. You can simply extend the DoTranspose with shared path to support any shape that only transpose the last two dimensions and keep the previous dimensions unchanged. Currently, the shared path only supports 2d transpose from new shape from perm [0, 1] to new shape with perm [1, 0]. We can extend it to transpose from [0, 1, 2] to [0, 2, 1] if the transpose meets the requirement that only transpose the last two dimensions by reshape it into 3d tensor [d0 * d1*...*dn-3, dn-2, dn-1]

Understood.
I intend to improve the current Transpose path discussed in the previous PR #26501.
Could I handle this as a separate task for a following PR?

qjia7 · 2025-11-24T10:14:36Z

onnxruntime/core/providers/webgpu/nn/im2col_matmul.wgsl.template

+    for (var inner_k_idx = 0u; inner_k_idx < TILE_K_VEC_SIZE; inner_k_idx++) {
+      let weight_data = weight_tile[inner_k_idx][local_idx];
+#if use_subgroup
+      let src_data = src_tile[inner_k_idx][sg_id];


What if the sg_size is larger than or less than TILE_M_SIZE?

Currently, Lunar Lake devices support a subgroup size of 32.
We must carefully add support for devices with different subgroup sizes, passing this value as a parameter to the template.

BTW, using subgroupShuffle improves the performance 5%~10%, comparing no subgroupShuffle.

If you only want to support sg_size = 32, I prefer you write it like blow. Otherwise, you can't make sure the sg_size must be 32 unless your check the supported subgroup range with subgroupMinSize = subgroupMaxSize = 32.

if (sg_size == 32) { let src_data = src_tile[inner_k_idx][sg_id]; for (var m_idx = 0u; m_idx < TILE_M_SIZE; m_idx++) { results[m_idx] += output_element_t(dot(weight_data, subgroupShuffle(src_data, m_idx))); } } else { for (var m_idx = 0u; m_idx < TILE_M_SIZE; m_idx++) { results[m_idx] += output_element_t(dot(weight_data, src_tile[inner_k_idx][m_idx])); } }

My understanding is that most modern GPUs typically support a subgroup size of 32 or greater, which this shader is designed to support.

The code's purpose is to mitigate potential performance penalties associated with the runtime conditional check (if (sg_size == 32)) within the shader.

if (sg_size == 32) { // do somethings }

Testing on Lunar Lake showed these penalties to be high.

Add comment in the code about it. Or I can simply make use_subgroup=false by default.

daijh added 3 commits November 18, 2025 14:11

[webgpu] im2col matmul

b1e5290

Update

2f7487e

Update

4efeff4

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 21, 2025

qjia7 reviewed Nov 24, 2025

View reviewed changes

daijh added 4 commits November 25, 2025 11:00

Fix comment

e6d48e4

Distinguish between weight and kernel in naming.

6a4bede

Do not support conv_1d; Can not use bias_element_t if no bias.

07073e1

Address comment of reviewdog

749c6e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] Optimize Conv by im2col-matmul #26603

[webgpu] Optimize Conv by im2col-matmul #26603

daijh commented Nov 19, 2025

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

daijh Nov 25, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

daijh Nov 25, 2025

Uh oh!

qjia7 Nov 24, 2025

Uh oh!

daijh Nov 25, 2025

Uh oh!

daijh Nov 25, 2025

Uh oh!

qjia7 Nov 25, 2025

Uh oh!

daijh Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	TensorShape nhwc_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};
	TensorShape ohwi_kernel_shape{channel_output, kernel_height, kernel_width, channel_input};

	Tensor nhwc_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);
	Tensor ohwi_kernel = context.CreateGPUTensor(kernel->DataType(), nhwc_kernel_shape);

[webgpu] Optimize Conv by im2col-matmul #26603

Are you sure you want to change the base?

[webgpu] Optimize Conv by im2col-matmul #26603

Conversation

daijh commented Nov 19, 2025

Description

Motivation and Context

Uh oh!

daijh commented Nov 19, 2025

Operation

sd-turbo

Uh oh!

daijh commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants